<a href="https://github.com/dd-consulting">
     <img src="../reference/GZ_logo.png" width="60" align="right">
</a>
<h1>
    One-Stop Analytics: Exploratory Data Analysis (EDA)
</h1>

Case Study of Autism Spectrum Disorder (ASD) with R


[ United States ]

Centers for Disease Control and Prevention (CDC) - Autism Spectrum Disorder (ASD)

Autism spectrum disorder (ASD) is a developmental disability that can cause significant social, communication and behavioral challenges. CDC is committed to continuing to provide essential data on ASD, search for factors that put children at risk for ASD and possible causes, and develop resources that help identify children with ASD as early as possible.

https://www.cdc.gov/ncbddd/autism/data/index.html

[ Singapore ]

TODAY Online - More preschoolers diagnosed with developmental issues

Doctors cited better awareness among parents and preschool teachers, leading to early referrals for diagnosis.

https://www.gov.sg/news/content/today-online-more-preschoolers-diagnosed-with-developmental-issues

https://www.pathlight.org.sg/

<a href="">
</a>

Workshop Objective:

Use R to analyze Autism Spectrum Disorder (ASD) data from CDC USA.

https://www.cdc.gov/ncbddd/autism/data/index.html

  • EDA - Summarization

  • Data Visualisation (Enhanced)

  • Workshop Submission

  • Appendices

<a href="">
</a>

Obtain current R working directory

getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"

Set new R working directory

# setwd("/media/sf_vm_shared_folder/git/DDC/DDC-ASD/model_R")
# setwd('~/Desktop/admin-desktop/vm_shared_folder/git/DDC-ASD/model_R')
getwd()
## [1] "/media/sf_vm_shared_folder/git/DDC-ASD/model_R"

Read in CSV data, storing as R dataframe

# Read back in above saved file:
ASD_National <- read.csv("../dataset/ADV_ASD_National_R.csv")
# Convert Year_Factor to ordered.factor
ASD_National$Year_Factor <- factor(ASD_National$Year_Factor, ordered = TRUE) 

EDA - Summarization

<h3>
EDA - Summarization - High Level Data Summary
</h3>
summary(ASD_National)
##   Source        Year        Prevalence        Upper.CI         Lower.CI     
##  addm: 8   Min.   :2000   Min.   : 1.800   Min.   : 1.800   Min.   : 1.700  
##  medi:13   1st Qu.:2004   1st Qu.: 3.950   1st Qu.: 3.950   1st Qu.: 3.875  
##  nsch: 4   Median :2008   Median : 6.650   Median : 6.900   Median : 6.350  
##  sped:17   Mean   :2007   Mean   : 7.952   Mean   : 8.207   Mean   : 7.712  
##            3rd Qu.:2011   3rd Qu.: 9.725   3rd Qu.:10.350   3rd Qu.: 9.625  
##            Max.   :2016   Max.   :29.200   Max.   :30.700   Max.   :27.700  
##                                                                             
##                                                  Source_Full1
##  Autism & Developmental Disabilities Monitoring Network: 8   
##  Medicaid                                              :13   
##  National Survey of Children's Health                  : 4   
##  Special Education Child Count                         :17   
##                                                              
##                                                              
##                                                              
##                                                       Source_Full2
##  addm-Autism & Developmental Disabilities Monitoring Network: 8   
##  medi-Medicaid                                              :13   
##  nsch-National Survey of Children's Health                  : 4   
##  sped-Special Education Child Count                         :17   
##                                                                   
##                                                                   
##                                                                   
##  Male.Prevalence Male.Lower.CI   Male.Upper.CI   Female.Prevalence
##  Min.   :11.50   Min.   :12.20   Min.   :13.70   Min.   :2.700    
##  1st Qu.:13.70   1st Qu.:14.85   1st Qu.:16.07   1st Qu.:3.050    
##  Median :18.40   Median :20.20   Median :21.55   Median :4.000    
##  Mean   :18.71   Mean   :19.22   Mean   :20.62   Mean   :4.271    
##  3rd Qu.:23.55   3rd Qu.:22.93   3rd Qu.:24.32   3rd Qu.:5.250    
##  Max.   :26.60   Max.   :25.80   Max.   :27.40   Max.   :6.600    
##  NA's   :35      NA's   :36      NA's   :36      NA's   :35       
##  Female.Lower.CI Female.Upper.CI Non.hispanic.white.Prevalence
##  Min.   :2.600   Min.   :3.300   Min.   : 7.70                
##  1st Qu.:3.100   1st Qu.:3.700   1st Qu.: 9.80                
##  Median :4.300   Median :4.950   Median :12.00                
##  Mean   :4.217   Mean   :4.900   Mean   :12.51                
##  3rd Qu.:4.975   3rd Qu.:5.675   3rd Qu.:15.55                
##  Max.   :6.200   Max.   :7.000   Max.   :17.20                
##  NA's   :36      NA's   :36      NA's   :35                   
##  Non.hispanic.white.Lower.CI Non.hispanic.white.Upper.CI
##  Min.   : 9.100              Min.   :10.40              
##  1st Qu.: 9.925              1st Qu.:10.93              
##  Median :13.100              Median :14.20              
##  Mean   :12.733              Mean   :13.88              
##  3rd Qu.:15.075              3rd Qu.:16.20              
##  Max.   :16.500              Max.   :17.80              
##  NA's   :36                  NA's   :36                 
##  Non.hispanic.black.Prevalence Non.hispanic.black.Lower.CI
##  Min.   : 6.50                 Min.   : 6.200             
##  1st Qu.: 7.05                 1st Qu.: 7.325             
##  Median :10.20                 Median :10.500             
##  Mean   :10.31                 Mean   :10.200             
##  3rd Qu.:12.70                 3rd Qu.:12.100             
##  Max.   :16.00                 Max.   :15.100             
##  NA's   :35                    NA's   :36                 
##  Non.hispanic.black.Upper.CI Hispanic.Prevalence Hispanic.Lower.CI
##  Min.   : 7.600              Min.   : 5.900      Min.   : 5.000   
##  1st Qu.: 8.575              1st Qu.: 6.625      1st Qu.: 5.775   
##  Median :12.000              Median : 9.000      Median : 8.300   
##  Mean   :11.700              Mean   : 9.150      Mean   : 8.333   
##  3rd Qu.:13.700              3rd Qu.:10.625      3rd Qu.: 9.850   
##  Max.   :16.900              Max.   :14.000      Max.   :13.100   
##  NA's   :36                  NA's   :36          NA's   :36       
##  Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
##  Min.   : 6.600    Min.   : 9.70                       
##  1st Qu.: 7.775    1st Qu.:10.97                       
##  Median : 9.750    Median :11.85                       
##  Mean   :10.017    Mean   :11.72                       
##  3rd Qu.:11.425    3rd Qu.:12.60                       
##  Max.   :14.900    Max.   :13.50                       
##  NA's   :36        NA's   :38                          
##  Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
##  Min.   : 8.10                      Min.   :11.60                     
##  1st Qu.: 9.45                      1st Qu.:12.72                     
##  Median :10.30                      Median :13.65                     
##  Mean   :10.12                      Mean   :13.57                     
##  3rd Qu.:10.97                      3rd Qu.:14.50                     
##  Max.   :11.80                      Max.   :15.40                     
##  NA's   :38                         NA's   :38                        
##  Source_UC                                                      Source_Full3
##  ADDM: 8   ADDM Autism & Developmental Disabilities Monitoring Network: 8   
##  MEDI:13   MEDI Medicaid                                              :13   
##  NSCH: 4   NSCH National Survey of Children's Health                  : 4   
##  SPED:17   SPED Special Education Child Count                         :17   
##                                                                             
##                                                                             
##                                                                             
##  Prevalence_Risk2  Prevalence_Risk4  Year_Factor
##  High:28          High     : 8      2004   : 4  
##  Low :14          Low      :14      2008   : 4  
##                   Medium   :18      2012   : 4  
##                   Very High: 2      2000   : 3  
##                                     2002   : 3  
##                                     2006   : 3  
##                                     (Other):21

Data Visualisation (Enhanced)

if(!require(ggplot2)){install.packages("ggplot2")}
## Loading required package: ggplot2
library(ggplot2)
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] Explore the Data</span>
</h3>

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Explore the Data</span>
</h3>
# ----------------------------------
# [National] < Years Data Available >
# ----------------------------------
p = ggplot(ASD_National, aes(x = 1, fill = Source)) + 
  geom_bar() + theme(axis.text.x=element_blank(),  # Hide axis
                     axis.ticks.x=element_blank(), # Hide axis
                     axis.text.y=element_blank(),  # Hide axis
                     axis.ticks.y=element_blank(), # Hide axis
                     panel.background = element_blank(), # Remove panel background
                     legend.position="top"
  ) + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) +
  labs(x="", y="", title="Years Data Available") + # layers of graphics
  facet_grid(facets = Source~Year)
# Show plot
p

<h3>
Data Visualisation (Enhanced) - Barplot
</h3>
# Create bar chart using R graphics
barplot(table(ASD_National$Source))

# Create bar chart using ggplot2
ggplot(ASD_National, aes(x = Source)) + geom_bar(fill = "blue", alpha=0.5)

# Use color to differentiate sub-group data (Year)
ggplot(ASD_National, aes(x = Source, fill = factor(Year))) + geom_bar() + 
  theme(legend.position="top") + labs(fill = "Legend: Year")

# Split chart to mutiple columns by using: facets = . ~ Year
ggplot(ASD_National, aes(x = Source, fill = Source)) + geom_bar() + 
  theme(legend.position="top") + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) +
  facet_grid(facets = . ~ Year)

# Split chart to mutiple rows and columns by using: facets = Source ~ Year
ggplot(ASD_National, aes(x = Source, fill = Source)) + geom_bar() + 
  theme(legend.position="top") + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) +
  facet_grid(facets = Source~Year)

Above chart is now very similar to earlier [National] < Years Data Available >.

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Prevalence by Data Sources and Risk Levels</span>
</h3>
# Use color to differentiate sub-group data (Year)
ggplot(ASD_National, aes(x = Source, fill = Prevalence_Risk4)) + 
  geom_bar(alpha=0.95, position = position_stack(reverse = TRUE)) + # Reverse default colour/fill order
  scale_fill_manual("Data Source:", values = c("Low" = "lightyellow", 
                                               "Medium" = "orange", 
                                               "High" = "red",
                                               "Very High" = "darkred")) +
  labs(x="Data Sources", y="Occurrences", title="Prevalence by Data Sources and Risk Levels") + # layers of graphics
  theme(legend.position="top") + labs(fill = "Legend: Risk")

Barplot / Column plot

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] REPORTED PREVALENCE VARIES BY SEX</span>
</h3>

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] [ Year: 2014 ]
</h3>
# Filter only data of ADDM
ASD_National_ADDM <- subset(ASD_National, Source == 'addm')
#
ASD_National_ADDM
##   Source Year Prevalence Upper.CI Lower.CI
## 1   addm 2000        6.7      7.0      6.3
## 2   addm 2002        6.6      6.8      6.3
## 3   addm 2004        8.0      8.4      7.6
## 4   addm 2006        9.0      9.3      8.6
## 5   addm 2008       11.3     11.7     11.0
## 6   addm 2010       14.7     15.1     14.3
## 7   addm 2012       14.8     15.2     14.4
## 8   addm 2014       16.8     17.3     16.4
##                                             Source_Full1
## 1 Autism & Developmental Disabilities Monitoring Network
## 2 Autism & Developmental Disabilities Monitoring Network
## 3 Autism & Developmental Disabilities Monitoring Network
## 4 Autism & Developmental Disabilities Monitoring Network
## 5 Autism & Developmental Disabilities Monitoring Network
## 6 Autism & Developmental Disabilities Monitoring Network
## 7 Autism & Developmental Disabilities Monitoring Network
## 8 Autism & Developmental Disabilities Monitoring Network
##                                                  Source_Full2 Male.Prevalence
## 1 addm-Autism & Developmental Disabilities Monitoring Network              NA
## 2 addm-Autism & Developmental Disabilities Monitoring Network            11.5
## 3 addm-Autism & Developmental Disabilities Monitoring Network            12.9
## 4 addm-Autism & Developmental Disabilities Monitoring Network            14.5
## 5 addm-Autism & Developmental Disabilities Monitoring Network            18.4
## 6 addm-Autism & Developmental Disabilities Monitoring Network            23.7
## 7 addm-Autism & Developmental Disabilities Monitoring Network            23.4
## 8 addm-Autism & Developmental Disabilities Monitoring Network            26.6
##   Male.Lower.CI Male.Upper.CI Female.Prevalence Female.Lower.CI Female.Upper.CI
## 1            NA            NA                NA              NA              NA
## 2            NA            NA               2.7              NA              NA
## 3          12.2          13.7               2.9             2.6             3.3
## 4          13.9          15.1               3.2             2.9             3.5
## 5          17.7          19.0               4.0             3.7             4.3
## 6          23.0          24.4               5.3             5.0             5.7
## 7          22.7          24.1               5.2             4.9             5.6
## 8          25.8          27.4               6.6             6.2             7.0
##   Non.hispanic.white.Prevalence Non.hispanic.white.Lower.CI
## 1                            NA                          NA
## 2                           7.7                          NA
## 3                           9.7                         9.1
## 4                           9.9                         9.4
## 5                          12.0                        11.5
## 6                          15.8                        15.2
## 7                          15.3                        14.7
## 8                          17.2                        16.5
##   Non.hispanic.white.Upper.CI Non.hispanic.black.Prevalence
## 1                          NA                            NA
## 2                          NA                           6.5
## 3                        10.4                           6.9
## 4                        10.4                           7.2
## 5                        12.5                          10.2
## 6                        16.3                          12.3
## 7                        15.9                          13.1
## 8                        17.8                          16.0
##   Non.hispanic.black.Lower.CI Non.hispanic.black.Upper.CI Hispanic.Prevalence
## 1                          NA                          NA                  NA
## 2                          NA                          NA                  NA
## 3                         6.2                         7.6                 6.2
## 4                         6.6                         7.8                 5.9
## 5                         9.5                        10.9                 7.9
## 6                        11.5                        13.1                10.8
## 7                        12.3                        13.9                10.1
## 8                        15.1                        16.9                14.0
##   Hispanic.Lower.CI Hispanic.Upper.CI Asian.or.Pacific.Islander.Prevalence
## 1                NA                NA                                   NA
## 2                NA                NA                                   NA
## 3               5.0               7.5                                   NA
## 4               5.3               6.6                                   NA
## 5               7.2               8.6                                  9.7
## 6              10.0              11.6                                 12.3
## 7               9.4              10.9                                 11.4
## 8              13.1              14.9                                 13.5
##   Asian.or.Pacific.Islander.Lower.CI Asian.or.Pacific.Islander.Upper.CI
## 1                                 NA                                 NA
## 2                                 NA                                 NA
## 3                                 NA                                 NA
## 4                                 NA                                 NA
## 5                                8.1                               11.6
## 6                               10.7                               14.2
## 7                                9.9                               13.1
## 8                               11.8                               15.4
##   Source_UC                                                Source_Full3
## 1      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 2      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 3      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 4      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 5      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 6      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 7      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
## 8      ADDM ADDM Autism & Developmental Disabilities Monitoring Network
##   Prevalence_Risk2 Prevalence_Risk4 Year_Factor
## 1             High           Medium        2000
## 2             High           Medium        2002
## 3             High           Medium        2004
## 4             High           Medium        2006
## 5             High             High        2008
## 6             High             High        2010
## 7             High             High        2012
## 8             High             High        2014
# Construct a new re-shaped dataframe of [ Source: ADDM ] [Year: 2014]
#
Process_Source = 'addm'
Process_Year = 2014

Define a function to create a re-shaped dataframe:

Function_Reshape_ASD_National_ADDM <- function(Process_Source, Process_Year) {
    # Create the vectors:
    Sex.Group  = c('Overall', 
                   'Boys', 
                   'Girls')
    Sex.Group

    Prevalence = c(ASD_National_ADDM$Prevalence[ASD_National_ADDM$Year == Process_Year],
                   ASD_National_ADDM$Male.Prevalence[ASD_National_ADDM$Year == Process_Year],
                   ASD_National_ADDM$Female.Prevalence[ASD_National_ADDM$Year == Process_Year])
    Prevalence

    # Combine all the vectors into a data frame:
    ASD_National_ADDM_Reshaped_DF = data.frame(Sex.Group, Prevalence, stringsAsFactors=T)

    # Add new columns:
    ASD_National_ADDM_Reshaped_DF$Source = Process_Source
    ASD_National_ADDM_Reshaped_DF$Year = Process_Year
    return(ASD_National_ADDM_Reshaped_DF) # Return a dataframe
}

Use defined function Function_Reshape_ASD_National_ADDM( ) for a specific year:

ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2014)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall       16.8   addm 2014
## 2      Boys       26.6   addm 2014
## 3     Girls        6.6   addm 2014

Visualise: Prevalence Estimates by Sex [ Source: ADDM ] [ Year: 2014 ]

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=3)
ggplot(ASD_National_ADDM_Reshaped_DF, aes(Sex.Group, Prevalence)) +
  geom_col(aes(fill = Sex.Group), alpha=0.5) + # Use column chart
  geom_text(aes(label = Prevalence), vjust = +0.75, hjust = -0.2, size = 3) +
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_discrete(name = "") +
  scale_fill_manual("Sex Group:", values = c("Overall" = "purple", 
                                             "Boys" = "blue",
                                             "Girls" = "orange")) + 
  ggtitle("Prevalence Estimates by Sex [ Source: ADDM ] [ Year: 2014 ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"),
        legend.position = 'none') + 
  coord_flip()  # Rotate chart

#  facet_grid(facets = Year ~ .)
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] [ Year: ALL ]
</h3>
# Create a new datafarme to hold re-shaped data for all years.
ASD_National_ADDM_Reshaped_DF_All = ASD_National_ADDM_Reshaped_DF # Loaded with initial [ Year: 2014 ] data
Process_Source = 'addm'
unique(ASD_National_ADDM$Year)
## [1] 2000 2002 2004 2006 2008 2010 2012 2014

Use defined function Function_Reshape_ASD_National_ADDM( ) for ALL remaining years:

ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2012)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall       14.8   addm 2012
## 2      Boys       23.4   addm 2012
## 3     Girls        5.2   addm 2012
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2010)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall       14.7   addm 2010
## 2      Boys       23.7   addm 2010
## 3     Girls        5.3   addm 2010
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2008)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall       11.3   addm 2008
## 2      Boys       18.4   addm 2008
## 3     Girls        4.0   addm 2008
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2006)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall        9.0   addm 2006
## 2      Boys       14.5   addm 2006
## 3     Girls        3.2   addm 2006
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2004)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall        8.0   addm 2004
## 2      Boys       12.9   addm 2004
## 3     Girls        2.9   addm 2004
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2002)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall        6.6   addm 2002
## 2      Boys       11.5   addm 2002
## 3     Girls        2.7   addm 2002
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
ASD_National_ADDM_Reshaped_DF <- Function_Reshape_ASD_National_ADDM(Process_Source = 'addm', Process_Year = 2000)
ASD_National_ADDM_Reshaped_DF
##   Sex.Group Prevalence Source Year
## 1   Overall        6.7   addm 2000
## 2      Boys         NA   addm 2000
## 3     Girls         NA   addm 2000
# Append rows to existing dataframe, using Row Bind function: rbind()
ASD_National_ADDM_Reshaped_DF_All = rbind(ASD_National_ADDM_Reshaped_DF_All, ASD_National_ADDM_Reshaped_DF)
# Re-shaped ADDM data for ALL years:
ASD_National_ADDM_Reshaped_DF_All
##    Sex.Group Prevalence Source Year
## 1    Overall       16.8   addm 2014
## 2       Boys       26.6   addm 2014
## 3      Girls        6.6   addm 2014
## 4    Overall       14.8   addm 2012
## 5       Boys       23.4   addm 2012
## 6      Girls        5.2   addm 2012
## 7    Overall       14.7   addm 2010
## 8       Boys       23.7   addm 2010
## 9      Girls        5.3   addm 2010
## 10   Overall       11.3   addm 2008
## 11      Boys       18.4   addm 2008
## 12     Girls        4.0   addm 2008
## 13   Overall        9.0   addm 2006
## 14      Boys       14.5   addm 2006
## 15     Girls        3.2   addm 2006
## 16   Overall        8.0   addm 2004
## 17      Boys       12.9   addm 2004
## 18     Girls        2.9   addm 2004
## 19   Overall        6.6   addm 2002
## 20      Boys       11.5   addm 2002
## 21     Girls        2.7   addm 2002
## 22   Overall        6.7   addm 2000
## 23      Boys         NA   addm 2000
## 24     Girls         NA   addm 2000

Visualise: Prevalence Estimates by Sex [ Source: ADDM ] [ Year: ALL ]

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
ggplot(ASD_National_ADDM_Reshaped_DF_All, aes(Sex.Group, Prevalence)) +
  geom_col(aes(fill = Sex.Group), alpha=0.75) + # Use column chart
  geom_text(aes(label = Prevalence), vjust = +0.5, hjust = -0.2, size = 2.5) +
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_discrete(name = "") +
  scale_fill_manual("Sex Group:", values = c("Overall" = "purple", 
                                             "Boys" = "blue",
                                             "Girls" = "orange")) + 
  ggtitle("Prevalence Estimates by Sex [ Source: ADDM ] [ Year: ALL ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"),
        legend.position = 'none') + 
  coord_flip() + # Rotate chart
  facet_grid(facets = Year ~ .)
## Warning: Removed 2 rows containing missing values (position_stack).
## Warning: Removed 2 rows containing missing values (geom_text).

<h3>
Data Visualisation (Enhanced) - Histogram (distribution of binned continuous variable)
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# Create histogram using R graphics
hist(ASD_National$Prevalence)

# Create histogram using ggplot2
ggplot(ASD_National, aes(x=Prevalence)) + 
  geom_histogram(binwidth = 5, fill = "blue", color = "lightgrey", alpha=0.5)

# Use color to differentiate sub-group data (Data Source)
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
  geom_histogram(binwidth = 5) +
  theme_bw() + theme(legend.position="right") +
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue"))

# Plot sub-group data side by side, using position="dodge"
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
  geom_histogram(binwidth = 5, position="dodge") +
  theme_bw() + theme(legend.position="right") +
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue"))

# Split plots using facet_grid()
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
  geom_histogram(binwidth = 5) +
  theme(legend.position="right") + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) +
  facet_grid(facets = Source ~ .)

# Add title and caption using ggplot2
ggplot(ASD_National, aes(x=Prevalence, fill = Source)) +
  geom_histogram(binwidth = 5) +
  theme(legend.position="top") + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) + 
  labs(x="Prevalence per 1,000 Children",
       y="Frequency",
       title="Distribution of Prevalence by Data Source") +
  facet_grid(facets = Source ~ .)

<h3>
Data Visualisation (Enhanced) - Density plot (distribution for continuous variable normalized to 100% area under curve)
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# Create plot using R graphics
plot(density(ASD_National$Prevalence))
# Optionally, overlay histogram
hist(ASD_National$Prevalence, probability = TRUE, add = TRUE)

# Create plot using ggplot2
p <- ggplot(ASD_National) +
  geom_density(aes(x=Prevalence), fill = "grey", color = "white", alpha=0.75) 
p # Show

# Optionally, overlay histogram
p <- p + geom_histogram(aes(x = Prevalence, y = ..density..), binwidth = 1, fill = "blue", colour = "lightgrey", alpha=0.4) 
p # Show

# Optionally, overlay Prevalence mean
p <- p + geom_vline(aes(xintercept = mean(ASD_National$Prevalence)), colour="darkorange")
p # Show

# Lastly, add other captions
p <- p + coord_cartesian(xlim=c(0, 35), ylim=c(0, 0.2)) +
  labs(x="Prevalence per 1,000 Children", y="Density", 
       title=paste("Density of Prevalence ( mean =", mean(ASD_National$Prevalence), ")")) +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"))
p # Show

< Prevelance distribution by Data Source >

# Prevelance distribution by Data Source
ggplot(ASD_National) + geom_density(aes(x = Prevalence, fill = Source), alpha = 0.5) + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) +
  labs(x="Prevalence per 1,000 Children",
       y="Density",
       title="Density of Prevalence by Data Source") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"))

< Prevelance distribution by Data Source with split >

# Prevelance distribution by Data Source with split
ggplot(ASD_National) + geom_density(aes(x = Prevalence, fill = Source), colour = 'lightgrey', alpha = 0.75) + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) + 
  labs(x="Prevalence per 1,000 Children",
       y="Density",
       title="Density of Prevalence by Data Source") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey")) + 
  facet_wrap(~Source)

<h3>
Data Visualisation (Enhanced) - Box plot
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# Create plot using R graphics
# Create 'Prevalence' box plots break by 'Source'
boxplot(ASD_National$Prevalence ~ ASD_National$Source,
        main = "National ASD Prevalence by Data Source",
        xlab = "Data Source",
        ylab = "Prevalence per 1,000 Children",
        sub  = "Year 2000 - 2016",
        col.main="blue", col.lab="black", col.sub="darkgrey")

# Create box plot using ggplot2
ggplot(ASD_National, aes(x = Source, y = Prevalence, fill = Source)) + 
  geom_boxplot(alpha = 0.5) + 
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_discrete(name = "Data Source (Year 2000 - 2016)") +
  ggtitle("National ASD Prevalence by Data Source") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"))

<h3>
Data Visualisation (Enhanced) - Violin plot
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# Create plot using ggplot2
ggplot(ASD_National, aes(x = Source, y = Prevalence, fill = Source)) + 
  geom_violin(alpha = 0.5) + 
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_discrete(name = "Data Source (Year 2000 - 2016)") +
  ggtitle("National ASD Prevalence by Data Source") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"))

# Create plot using ggplot2
ggplot(ASD_National, aes(x = Source, y = Prevalence, fill = Source)) + 
  geom_violin(alpha = 0.5) + 
  geom_jitter(alpha = 0.5, position = position_jitter(width = 0.1)) + # Overlay datapoints
#  coord_flip() + # Uncomment to flip x-y axis
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_discrete(name = "Data Source (Year 2000 - 2016)") +
  ggtitle("National ASD Prevalence by Data Source") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"))

<h3>
Data Visualisation (Enhanced) - Line chart
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] REPORTED PREVALENCE HAS CHANGED OVER TIME</span>
</h3>

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE HAS CHANGED OVER TIME</span> [Source: ALL]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# Build chart/plot layer by layer
# ----------------------------------

# Define a ggplot graphic object; provide data and x y for use
p <- ggplot(ASD_National, aes(x = Year, y = Prevalence))
# Show plot
p

# Select (add) line chart type:
p <- p + geom_line(aes(color = Source),
                   linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
                   size=1,
                   alpha=0.5) 
# Show plot
p

# Select (add) points to chart:
p <- p + geom_point(aes(color = Source),
                    size=2, 
                    shape=20,
                    alpha=0.5) 
# Show plot
p

# Customize line color and legend name:
p <- p + scale_color_manual("Data Source:", 
                            labels = c('ADDM', 'MEDI', 'NSCH', 'SPED'),
                            values = c("addm" = "darkblue", 
                                       "medi" = "orange", 
                                       "nsch" = "darkred",
                                       "sped" = "skyblue"))
# Show plot
p

# Adjust x and y axis, scale, limit and labels:
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
                            breaks = seq(0, 30, 5),
                            limits=c(0, 30)) +
  scale_x_continuous(name = "Year", 
                     breaks = seq(2000, 2016, 1), 
                     limits = c(2000, 2016)) 
# Show plot
p

# Customise chart title:
p <- p + ggtitle("Prevalence Estimates Over Time [ Source: ALL ]") 
# Show plot
p

# Customise chart title and axis labels:
p <- p + theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
               axis.title = element_text(face = 'plain', color = "darkslategrey")) 
# Show plot
p

Consolidate above code into one chunk:

# ----------------------------------
# Consolidate above code into one chunk
# ----------------------------------
p <- ggplot(ASD_National, aes(x = Year, y = Prevalence)) +
  geom_line(aes(color = Source),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(color = Source),
             size=2, 
             shape=20,
             alpha=0.5) + 
  scale_color_manual("Data Source:", 
                     labels = c('ADDM', 'MEDI', 'NSCH', 'SPED'),
                     values = c("addm" = "darkblue", 
                                "medi" = "orange", 
                                "nsch" = "darkred",
                                "sped" = "skyblue")) +
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_continuous(name = "Year", 
                     breaks = seq(2000, 2016, 1), 
                     limits = c(2000, 2016)) +
  ggtitle("Prevalence Estimates Over Time [ Source: ALL ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"))
# Show plot
p

Optionally, display data values/labels:

# Optionally, displaydata values/labels
p + geom_text(aes(label = round(Prevalence, 1)), # Values are rounded for display
              vjust = "outward", 
              #          nudge_y = 0.2, # optionally life the text
              hjust = "outward", 
              check_overlap = TRUE,
              size = 3, # size of textual data label
              col = 'darkslategrey')

<h3>
Data Visualisation (Enhanced) - Dynamic Visualisation with plotly
</h3>
if(!require(knitr)){install.packages("knitr")}
## Loading required package: knitr
library("knitr")
if(!require(plotly)){install.packages("plotly")}
## Loading required package: plotly
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library("plotly")

Create ployly graph object from ggplot graph object:

p_dynamic <- p
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Data Visualisation (Enhanced) - Use themes as aesthetic template
</h3>
if(!require(ggthemes)){install.packages("ggthemes")}
## Loading required package: ggthemes
library('ggthemes')

Theme of the Economist magazine:

# Theme of the economist magazine:
p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

Theme of the Wall Street Journal:

# Theme of the Wall Street Journal:
p + theme_wsj() + scale_colour_wsj("colors6")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.

Dynamic chart with theme of the economist magazine:

# Dynamic chart with theme of the economist magazine:
p_dynamic <- p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] ADDM Network estimates for overall ASD prevalence in US over time</span> [ Source: ADDM ] over [ Year ]
</h3>

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] ADDM Network estimates for overall ASD prevalence in US over time</span> [ Source: ADDM ] over [ Year ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)
# Filter only data of ADDM
ASD_National_ADDM <- subset(ASD_National, Source == 'addm')
# ----------------------------------
# [addm] ADDM Network estimates for overall ASD prevalence in US over time
# ----------------------------------

#  Color:
# 'ADDM_Average' "purple"

p <- ggplot(ASD_National_ADDM, aes(x = Year, y = Prevalence)) +
  geom_point(aes(y = Prevalence, color = 'ADDM_Average'), # Name for manual colour mapping
             size=2, 
             shape=20,
             alpha=0.95) +
  # Add point for Upper.CI
  geom_point(aes(y = Upper.CI, color = 'ADDM_U_CI'), # Name for manual colour mapping
             size=0.1, 
             shape=20,
             alpha=0.95) +
  # Add point for Upper.CI
  geom_point(aes(y = Lower.CI, color = 'ADDM_L_CI'), # Name for manual colour mapping
             size=0.1, 
             shape=20,
             alpha=0.95) +
  scale_colour_manual(name="",
                      labels = c("US (ADDM)", "Upper CI", "Lower CI"), # Names shown in legend 
                      values = c(ADDM_Average="purple", ADDM_U_CI="red", ADDM_L_CI="red")) # Manual colour mapping
# Add title, axis label, and axis scale
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
                            breaks = seq(0, 18, 2),
                            limits=c(0, 18)) +
  scale_x_continuous(name = "Year", 
                     breaks = seq(2000, 2014, 2), 
                     limits = c(2000, 2014)) +
  ggtitle("ADDM Network estimates for overall ASD prevalence in US over time\nwith confidence interval") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"),
        panel.background = element_blank(), # Remove chart backgroun colour
        legend.position = 'top',
        panel.grid.major = element_line(size = 0.2, linetype = 'solid', colour = "lightgrey") # grid colour et al
       )
# Show plot
p

# Add smooth curve to go through date points, using interpolation with splines:
# https://stackoverflow.com/questions/35205795/plotting-smooth-line-through-all-data-points
spline_ADDM_Prevalence <- as.data.frame(spline(ASD_National_ADDM$Year, ASD_National_ADDM$Prevalence))
spline_ADDM_Prevalence_U_CI <- as.data.frame(spline(ASD_National_ADDM$Year, ASD_National_ADDM$Upper.CI))
spline_ADDM_Prevalence_L_CI <- as.data.frame(spline(ASD_National_ADDM$Year, ASD_National_ADDM$Lower.CI))
# Show plot
p + geom_line(data = spline_ADDM_Prevalence, aes(x = x, y = y, color = 'ADDM_Average'), linetype = "solid", size=0.6) + 
  geom_line(data = spline_ADDM_Prevalence_U_CI, aes(x = x, y = y, color = 'ADDM_U_CI'), linetype = 2, size=0.3) +
  geom_line(data = spline_ADDM_Prevalence_L_CI, aes(x = x, y = y, color = 'ADDM_L_CI'), linetype = 2, size=0.3)

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY SEX</span> [ Source: ADDM ] over [ Year ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# [addm] < Prevalence Varies by Sex >
# ----------------------------------

#  Color:
# 'ADDM_Average' "darkslategrey"
# 'Female_Prevalence' "orange"
# 'Male_Prevalence' "blue"

p <- ggplot(ASD_National_ADDM, aes(x = Year, y = Prevalence)) +
  geom_line(aes(y = Prevalence, colour = 'ADDM_Average'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Prevalence, color = 'ADDM_Average'),
             size=2, 
             shape=20,
             alpha=0.5) +
  # Add line for Female
  geom_line(aes(y = Female.Prevalence, colour = 'Female_Prevalence'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Female.Prevalence, color = 'Female_Prevalence'),
             size=2, 
             shape=20,
             alpha=0.5) +
  # Add line for Male
  geom_line(aes(y = Male.Prevalence, colour = 'Male_Prevalence'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Male.Prevalence, color = 'Male_Prevalence'),
             size=2, 
             shape=20,
             alpha=0.5) +
  scale_colour_manual(name="",
                      labels = c("ADDM Average", "Female Prevalence", "Male Prevalence"),
                      values = c(ADDM_Average="darkslategrey", Female_Prevalence="orange", Male_Prevalence="blue"))
# Add title, axis label, and axis scale
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
                            breaks = seq(0, 30, 5),
                            limits=c(0, 30)) +
  scale_x_continuous(name = "Year", 
                     breaks = seq(2000, 2016, 1), 
                     limits = c(2000, 2016)) +
  ggtitle("Prevalence Estimates by Sex [ Source: ADDM ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey")) 
# Show plot
p
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

# Apply theme
p + theme_economist() + scale_colour_economist() # p + theme_wsj() + scale_colour_wsj("colors6")
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

# Dynamic chart:
p_dynamic <- p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
    Quiz:
</h3>
<p>
    Add 95% Confidence Interval to above plot (Use ggplot)
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ CDC ] REPORTED PREVALENCE VARIES BY RACE AND ETHNICITY</span>
</h3>

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY RACE AND ETHNICITY</span> [ Source: ADDM ] With Average
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)
# ----------------------------------
# [addm] < Prevalence Varies by Race and Ethnicity >
# ----------------------------------

#  Color:
# 'ADDM_Average' "darkslategrey"
# 'Asian_Pacific_Islander' "darkred"
# 'Hispanic' "darkorchid3"
# 'Non_Hispanic_Black' "deepskyblue3"
# 'Non_Hispanic_White' "chartreuse3"

p <- ggplot(ASD_National_ADDM, aes(x = Year, y = Prevalence)) +
  geom_line(aes(y = Prevalence, colour = 'ADDM_Average'),
            linetype = "dotted",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Prevalence, color = 'ADDM_Average'),
             size=2, 
             shape=20,
             alpha=0) +
  # Add line for Asian.or.Pacific.Islander.Prevalence
  geom_line(aes(y = Asian.or.Pacific.Islander.Prevalence, colour = 'Asian_Pacific_Islander'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Asian.or.Pacific.Islander.Prevalence, colour = 'Asian_Pacific_Islander'),
             size=2, 
             shape=20,
             alpha=0.5) +
  # Add line for Hispanic.Prevalence
  geom_line(aes(y = Hispanic.Prevalence, colour = 'Hispanic'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Hispanic.Prevalence, colour = 'Hispanic'),
             size=2, 
             shape=20,
             alpha=0.5) +
  # Add line for Non.hispanic.black.Prevalence
  geom_line(aes(y = Non.hispanic.black.Prevalence, colour = 'Non_Hispanic_Black'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Non.hispanic.black.Prevalence, colour = 'Non_Hispanic_Black'),
             size=2, 
             shape=20,
             alpha=0.5) +
  # Add line for Non.hispanic.white.Prevalence
  geom_line(aes(y = Non.hispanic.white.Prevalence, colour = 'Non_Hispanic_White'),
            linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
            size=1,
            alpha=0.5) +
  geom_point(aes(y = Non.hispanic.white.Prevalence, colour = 'Non_Hispanic_White'),
             size=2, 
             shape=20,
             alpha=0.5) +
  scale_colour_manual(name="",
                      labels = c("ADDM Average", 
                                 "Asian/Pacific Islander", 
                                 "Hispanic", 
                                 "Non-Hispanic Black", 
                                 "Non-Hispanic White"),
                      values = c(ADDM_Average="darkslategrey", 
                                 Asian_Pacific_Islander ="darkred", 
                                 Hispanic ="darkorchid3", 
                                 Non_Hispanic_Black ="deepskyblue3", 
                                 Non_Hispanic_White ="chartreuse3"))
# Add title, axis label, and axis scale
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
                            breaks = seq(5, 20, 5),
                            limits=c(5, 20)) +
  scale_x_continuous(name = "Year", 
                     breaks = seq(2000, 2016, 1), 
                     limits = c(2000, 2016)) +
  ggtitle("Prevalence Estimates by Race/Ethnicity [ Source: ADDM ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey")) 
# Show plot
p
## Warning: Removed 4 rows containing missing values (geom_path).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing missing values (geom_path).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing missing values (geom_path).
## Warning: Removed 1 rows containing missing values (geom_point).

# Apply theme
# p + theme_economist() + scale_colour_economist() # p + theme_wsj() + scale_colour_wsj("colors6")
# Dynamic chart:
p_dynamic <- p + theme_economist() + scale_colour_economist()
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
    Quiz:
</h3>
<p>
    Change above zig-zag lines to spline/smooth lines.
</p>
<p>
    Hints: Refer to <span style="color:blue">ADDM Network estimates for overall ASD prevalence in US over time</span>.
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<a href="">
</a>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">US. State Level Data Processing</span>
</h3>
# ----------------------------------
# Dataset: US. State Level Children ASD Prevalence
# ----------------------------------

ASD_State    <- read.csv("../dataset/ADV_ASD_State.csv", stringsAsFactors = FALSE)

# Obtain number of rows and number of columns/features/variables
dim(ASD_State)
## [1] 1692   49
# Obtain overview (data structure/types)
str(ASD_State)
## 'data.frame':    1692 obs. of  49 variables:
##  $ State                                 : chr  "AZ" "GA" "MD" "NJ" ...
##  $ Denominator                           : int  45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
##  $ Prevalence                            : num  6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
##  $ Lower.CI                              : num  5.8 5.8 4.6 8.9 5.4 3.7 2.7 5.5 6.1 4.6 ...
##  $ Upper.CI                              : num  7.3 7.3 6.6 11.1 7.4 5.5 3.9 7 7.8 7.5 ...
##  $ Year                                  : int  2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
##  $ Source                                : chr  "addm" "addm" "addm" "addm" ...
##  $ Source_Full1                          : chr  "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
##  $ State_Full1                           : chr  "Arizona" "Georgia" "Maryland" "New Jersey" ...
##  $ State_Full2                           : chr  "AZ-Arizona" "GA-Georgia" "MD-Maryland" "NJ-New Jersey" ...
##  $ Numerator_ASD                         : int  295 283 118 294 155 104 117 280 252 65 ...
##  $ Numerator_NonASD                      : int  45027 43310 21414 29420 24380 22961 35355 44833 36220 10955 ...
##  $ Proportion                            : num  0.00651 0.00649 0.00548 0.00989 0.00632 ...
##  $ X95_Z_CI                              : num  0.00074 0.000754 0.000986 0.001125 0.000991 ...
##  $ Z_Lower.CI                            : num  5.77 5.74 4.49 8.77 5.33 ...
##  $ Z_Upper.CI                            : num  7.25 7.25 6.47 11.02 7.31 ...
##  $ Z_Lower.CI_ABSerror                   : num  0.0314 0.062 0.1059 0.1311 0.0739 ...
##  $ Z_Upper.CI_ABSerror                   : num  0.0507 0.0542 0.1337 0.0803 0.0911 ...
##  $ Chi_Wilson_P                          : num  0.00655 0.00654 0.00557 0.00996 0.00639 ...
##  $ X95_Chi_Wilson_CI                     : num  0.000741 0.000755 0.00099 0.001127 0.000994 ...
##  $ Chi_Wilson_Lower.CI                   : num  5.81 5.78 4.58 8.83 5.4 ...
##  $ Chi_Wilson_Upper.CI                   : num  7.29 7.29 6.56 11.08 7.39 ...
##  $ Chi_Wilson_Lower.CI_ABSerror          : num  0.009314 0.019761 0.021503 0.069416 0.000453 ...
##  $ Chi_Wilson_Upper.CI_ABSerror          : num  0.0077 0.00953 0.04165 0.01523 0.01087 ...
##  $ Chi_Wilson_Corrected_w_minus.CI       : num  0.0058 0.00577 0.00456 0.00881 0.00538 ...
##  $ Chi_Wilson_Corrected_w_plus.CI        : num  0.0073 0.0073 0.00658 0.0111 0.00741 ...
##  $ Chi_Wilson_Corrected_Lower.CI         : num  5.8 5.77 4.56 8.81 5.38 ...
##  $ Chi_Wilson_Corrected_Upper.CI         : num  7.3 7.3 6.58 11.1 7.41 ...
##  $ Chi_Wilson_Corrected_Lower.CI_ABSerror: num  0.00109 0.03057 0.04265 0.08529 0.01834 ...
##  $ Chi_Wilson_Corrected_Upper.CI_ABSerror: num  0.00395 0.0026 0.01636 0.00254 0.01108 ...
##  $ Male.Prevalence                       : num  9.7 11 8.6 14.8 9.3 6.6 5 10.1 10.7 9.9 ...
##  $ Male.Lower.CI                         : num  8.5 9.7 7.1 13 7.8 5.2 4.1 8.8 9.3 7.6 ...
##  $ Male.Upper.CI                         : num  11.1 12.4 10.6 16.8 11.2 8.2 6.2 11.4 12.3 12.9 ...
##  $ Female.Prevalence                     : num  3.2 2 2.2 4.3 3.3 2.4 1.4 2.2 2.9 1.7 ...
##  $ Female.Lower.CI                       : num  2.5 1.5 1.5 3.3 2.4 1.6 0.9 1.7 2.2 0.9 ...
##  $ Female.Upper.CI                       : num  4 2.7 2.7 5.5 4.5 3.5 2.1 2.9 3.8 3.2 ...
##  $ Non.hispanic.white.Prevalence         : num  8.6 7.9 4.9 11.3 6.5 4.5 3.3 7.7 7.4 6.4 ...
##  $ Non.hispanic.white.Lower.CI           : num  7.5 6.7 3.8 9.5 5.2 3.7 2.6 6.7 6.5 4.8 ...
##  $ Non.hispanic.white.Upper.CI           : num  9.8 9.3 6.4 13.3 8.2 5.5 4.1 8.9 8.6 8.5 ...
##  $ Non.hispanic.black.Prevalence         : chr  "7.3" "5.3" "6.1" "10.6" ...
##  $ Non.hispanic.black.Lower.CI           : chr  "4.4" "4.4" "4.7" "8.5" ...
##  $ Non.hispanic.black.Upper.CI           : chr  "12.2" "6.4" "8" "13.1" ...
##  $ Hispanic.Prevalence                   : chr  "No data" "No data" "No data" "No data" ...
##  $ Hispanic.Lower.CI                     : chr  "No data" "No data" "No data" "No data" ...
##  $ Hispanic.Upper.CI                     : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Prevalence  : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Lower.CI    : chr  "No data" "No data" "No data" "No data" ...
##  $ Asian.or.Pacific.Islander.Upper.CI    : chr  "No data" "No data" "No data" "No data" ...
##  $ State_Region                          : chr  "D8 Mountain" "D5 South Atlantic" "D5 South Atlantic" "D2 Middle Atlantic" ...
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">US. State Level Data</span> Pre-Process data
</h3>

Pre-Process data: Missing data

# Load required function from packages:
if(!require(naniar)){install.packages("naniar")}
## Loading required package: naniar
library(naniar)
if(!require(dplyr)){install.packages("dplyr")}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dplyr)
# Count missing values in dataframe:
sum(is.na(ASD_State)) # missing data recognised by R (NA)
## [1] 14454
# Define several offending strings
na_strings <- c("", "No data", "NA", "N A", "N / A", "N/A", "N/ A", "Not Available", "NOt available")
# Replace these defined missing values to R's internal NA
ASD_State = replace_with_na_all(ASD_State, condition = ~.x %in% na_strings)
# Count missing values in dataframe:
sum(is.na(ASD_State))
## [1] 28992

Remove invalid unicode char/string: 92

# Remove invalid unicode char/string: \x92
ASD_State$Source_Full1[ASD_State$Source_Full1 == "National Survey of Children\x92s Health"] <- "National Survey of Children's Health"

Delete/Drop variable by index: column from 14 to 26, 29, and 30

cbind(names(ASD_State), c(1:length(names(ASD_State))))
##       [,1]                                     [,2]
##  [1,] "State"                                  "1" 
##  [2,] "Denominator"                            "2" 
##  [3,] "Prevalence"                             "3" 
##  [4,] "Lower.CI"                               "4" 
##  [5,] "Upper.CI"                               "5" 
##  [6,] "Year"                                   "6" 
##  [7,] "Source"                                 "7" 
##  [8,] "Source_Full1"                           "8" 
##  [9,] "State_Full1"                            "9" 
## [10,] "State_Full2"                            "10"
## [11,] "Numerator_ASD"                          "11"
## [12,] "Numerator_NonASD"                       "12"
## [13,] "Proportion"                             "13"
## [14,] "X95_Z_CI"                               "14"
## [15,] "Z_Lower.CI"                             "15"
## [16,] "Z_Upper.CI"                             "16"
## [17,] "Z_Lower.CI_ABSerror"                    "17"
## [18,] "Z_Upper.CI_ABSerror"                    "18"
## [19,] "Chi_Wilson_P"                           "19"
## [20,] "X95_Chi_Wilson_CI"                      "20"
## [21,] "Chi_Wilson_Lower.CI"                    "21"
## [22,] "Chi_Wilson_Upper.CI"                    "22"
## [23,] "Chi_Wilson_Lower.CI_ABSerror"           "23"
## [24,] "Chi_Wilson_Upper.CI_ABSerror"           "24"
## [25,] "Chi_Wilson_Corrected_w_minus.CI"        "25"
## [26,] "Chi_Wilson_Corrected_w_plus.CI"         "26"
## [27,] "Chi_Wilson_Corrected_Lower.CI"          "27"
## [28,] "Chi_Wilson_Corrected_Upper.CI"          "28"
## [29,] "Chi_Wilson_Corrected_Lower.CI_ABSerror" "29"
## [30,] "Chi_Wilson_Corrected_Upper.CI_ABSerror" "30"
## [31,] "Male.Prevalence"                        "31"
## [32,] "Male.Lower.CI"                          "32"
## [33,] "Male.Upper.CI"                          "33"
## [34,] "Female.Prevalence"                      "34"
## [35,] "Female.Lower.CI"                        "35"
## [36,] "Female.Upper.CI"                        "36"
## [37,] "Non.hispanic.white.Prevalence"          "37"
## [38,] "Non.hispanic.white.Lower.CI"            "38"
## [39,] "Non.hispanic.white.Upper.CI"            "39"
## [40,] "Non.hispanic.black.Prevalence"          "40"
## [41,] "Non.hispanic.black.Lower.CI"            "41"
## [42,] "Non.hispanic.black.Upper.CI"            "42"
## [43,] "Hispanic.Prevalence"                    "43"
## [44,] "Hispanic.Lower.CI"                      "44"
## [45,] "Hispanic.Upper.CI"                      "45"
## [46,] "Asian.or.Pacific.Islander.Prevalence"   "46"
## [47,] "Asian.or.Pacific.Islander.Lower.CI"     "47"
## [48,] "Asian.or.Pacific.Islander.Upper.CI"     "48"
## [49,] "State_Region"                           "49"
# Delete/Drop variable by index: column from 14 to 26, 29, and 30
# names(ASD_State)
ASD_State <- ASD_State[ -c(14:26, 29, 30) ]

Create new variables

# Create one new variable: Source_UC as uppercase of Source
ASD_State$Source_UC <- paste(toupper(ASD_State$Source))
# Create one new variable: Source_Full3 by combining Source_UC and Source_Full1
ASD_State$Source_Full3 <- paste(ASD_State$Source_UC, ASD_State$Source_Full1)

Create one new ordinal categorical variable: Prevalence_Rank2 (“Low”, “High”) by binning Prevalence

# Recode Risk into category from Prevalence

# Low [0, 5)
# High [5, +oo) 

ASD_State$Prevalence_Risk2[ASD_State$Prevalence < 5] = "Low"
## Warning: Unknown or uninitialised column: 'Prevalence_Risk2'.
ASD_State$Prevalence_Risk2[ASD_State$Prevalence >= 5 ] = "High"
#
# head(ASD_State)

Create one new ordinal categorical variable: Prevalence_Rank4 (“Low”, “Medium”, “High”, “Very High”) by binning Prevalence

# Recode Risk into category from Prevalence

# Low [0, 5)
# Medium [5, 10)
# High [10, 20)
# Very High [20, +oo) 

ASD_State$Prevalence_Risk4 = "Very High"
ASD_State$Prevalence_Risk4[ASD_State$Prevalence < 20 ] = "High"
ASD_State$Prevalence_Risk4[ASD_State$Prevalence < 10 ] = "Medium"
ASD_State$Prevalence_Risk4[ASD_State$Prevalence < 5] = "Low"
#
# head(ASD_State)

Convert to correct data types

str(ASD_State)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1692 obs. of  38 variables:
##  $ State                               : chr  "AZ" "GA" "MD" "NJ" ...
##  $ Denominator                         : int  45322 43593 21532 29714 24535 23065 35472 45113 36472 11020 ...
##  $ Prevalence                          : num  6.5 6.5 5.5 9.9 6.3 4.5 3.3 6.2 6.9 5.9 ...
##  $ Lower.CI                            : num  5.8 5.8 4.6 8.9 5.4 3.7 2.7 5.5 6.1 4.6 ...
##  $ Upper.CI                            : num  7.3 7.3 6.6 11.1 7.4 5.5 3.9 7 7.8 7.5 ...
##  $ Year                                : int  2000 2000 2000 2000 2000 2000 2002 2002 2002 2002 ...
##  $ Source                              : chr  "addm" "addm" "addm" "addm" ...
##  $ Source_Full1                        : chr  "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" "Autism & Developmental Disabilities Monitoring Network" ...
##  $ State_Full1                         : chr  "Arizona" "Georgia" "Maryland" "New Jersey" ...
##  $ State_Full2                         : chr  "AZ-Arizona" "GA-Georgia" "MD-Maryland" "NJ-New Jersey" ...
##  $ Numerator_ASD                       : int  295 283 118 294 155 104 117 280 252 65 ...
##  $ Numerator_NonASD                    : int  45027 43310 21414 29420 24380 22961 35355 44833 36220 10955 ...
##  $ Proportion                          : num  0.00651 0.00649 0.00548 0.00989 0.00632 ...
##  $ Chi_Wilson_Corrected_Lower.CI       : num  5.8 5.77 4.56 8.81 5.38 ...
##  $ Chi_Wilson_Corrected_Upper.CI       : num  7.3 7.3 6.58 11.1 7.41 ...
##  $ Male.Prevalence                     : num  9.7 11 8.6 14.8 9.3 6.6 5 10.1 10.7 9.9 ...
##  $ Male.Lower.CI                       : num  8.5 9.7 7.1 13 7.8 5.2 4.1 8.8 9.3 7.6 ...
##  $ Male.Upper.CI                       : num  11.1 12.4 10.6 16.8 11.2 8.2 6.2 11.4 12.3 12.9 ...
##  $ Female.Prevalence                   : num  3.2 2 2.2 4.3 3.3 2.4 1.4 2.2 2.9 1.7 ...
##  $ Female.Lower.CI                     : num  2.5 1.5 1.5 3.3 2.4 1.6 0.9 1.7 2.2 0.9 ...
##  $ Female.Upper.CI                     : num  4 2.7 2.7 5.5 4.5 3.5 2.1 2.9 3.8 3.2 ...
##  $ Non.hispanic.white.Prevalence       : num  8.6 7.9 4.9 11.3 6.5 4.5 3.3 7.7 7.4 6.4 ...
##  $ Non.hispanic.white.Lower.CI         : num  7.5 6.7 3.8 9.5 5.2 3.7 2.6 6.7 6.5 4.8 ...
##  $ Non.hispanic.white.Upper.CI         : num  9.8 9.3 6.4 13.3 8.2 5.5 4.1 8.9 8.6 8.5 ...
##  $ Non.hispanic.black.Prevalence       : chr  "7.3" "5.3" "6.1" "10.6" ...
##  $ Non.hispanic.black.Lower.CI         : chr  "4.4" "4.4" "4.7" "8.5" ...
##  $ Non.hispanic.black.Upper.CI         : chr  "12.2" "6.4" "8" "13.1" ...
##  $ Hispanic.Prevalence                 : chr  NA NA NA NA ...
##  $ Hispanic.Lower.CI                   : chr  NA NA NA NA ...
##  $ Hispanic.Upper.CI                   : chr  NA NA NA NA ...
##  $ Asian.or.Pacific.Islander.Prevalence: chr  NA NA NA NA ...
##  $ Asian.or.Pacific.Islander.Lower.CI  : chr  NA NA NA NA ...
##  $ Asian.or.Pacific.Islander.Upper.CI  : chr  NA NA NA NA ...
##  $ State_Region                        : chr  "D8 Mountain" "D5 South Atlantic" "D5 South Atlantic" "D2 Middle Atlantic" ...
##  $ Source_UC                           : chr  "ADDM" "ADDM" "ADDM" "ADDM" ...
##  $ Source_Full3                        : chr  "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" "ADDM Autism & Developmental Disabilities Monitoring Network" ...
##  $ Prevalence_Risk2                    : chr  "High" "High" "High" "High" ...
##  $ Prevalence_Risk4                    : chr  "Medium" "Medium" "Medium" "Medium" ...
# cbind(names(ASD_State), c(1:length(names(ASD_State))))

Convert variables to numeric

# Convert Prevalence and CIs from categorical/chr to numeric
ix <- 13:33 # define an index
ASD_State[ix] <- lapply(ASD_State[ix], as.numeric)

Convert variables to categorical/factor

# Convert Source from categorical/chr to categorical/factor
ix <- c(1, 7, 8, 9, 10, 34, 35, 36) # define an index
ASD_State[ix] <- lapply(ASD_State[ix], as.factor)

# Create new ordered factor Year_Factor from Year
ASD_State$Year_Factor <- factor(ASD_State$Year, ordered = TRUE)

Convert Prevalence_Rank2 & Prevalence_Rank4 to ordered factor

# Convert to factor
ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE,
                                           levels=c("Low", "High"))
# Convert to factor
ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE,
                                           levels=c("Low", "Medium", "High", "Very High"))
# Display unique values (levels) of a factor categrotical 
lapply(select_if(ASD_State, is.factor), levels)
## $State
##  [1] "AK" "AL" "AR" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "GA" "HI" "IA" "ID" "IL"
## [16] "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MI" "MN" "MO" "MS" "MT" "NC" "ND" "NE"
## [31] "NH" "NJ" "NM" "NV" "NY" "OH" "OK" "OR" "PA" "RI" "SC" "SD" "TN" "TX" "UT"
## [46] "VA" "VT" "WA" "WI" "WV" "WY"
## 
## $Source
## [1] "addm" "medi" "nsch" "sped"
## 
## $Source_Full1
## [1] "Autism & Developmental Disabilities Monitoring Network"
## [2] "Medicaid"                                              
## [3] "National Survey of Children's Health"                  
## [4] "Special Education Child Count"                         
## 
## $State_Full1
##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"             
## 
## $State_Full2
##  [1] "AK-Alaska"               "AL-Alabama"             
##  [3] "AR-Arkansas"             "AZ-Arizona"             
##  [5] "CA-California"           "CO-Colorado"            
##  [7] "CT-Connecticut"          "DC-District of Columbia"
##  [9] "DE-Delaware"             "FL-Florida"             
## [11] "GA-Georgia"              "HI-Hawaii"              
## [13] "IA-Iowa"                 "ID-Idaho"               
## [15] "IL-Illinois"             "IN-Indiana"             
## [17] "KS-Kansas"               "KY-Kentucky"            
## [19] "LA-Louisiana"            "MA-Massachusetts"       
## [21] "MD-Maryland"             "ME-Maine"               
## [23] "MI-Michigan"             "MN-Minnesota"           
## [25] "MO-Missouri"             "MS-Mississippi"         
## [27] "MT-Montana"              "NC-North Carolina"      
## [29] "ND-North Dakota"         "NE-Nebraska"            
## [31] "NH-New Hampshire"        "NJ-New Jersey"          
## [33] "NM-New Mexico"           "NV-Nevada"              
## [35] "NY-New York"             "OH-Ohio"                
## [37] "OK-Oklahoma"             "OR-Oregon"              
## [39] "PA-Pennsylvania"         "RI-Rhode Island"        
## [41] "SC-South Carolina"       "SD-South Dakota"        
## [43] "TN-Tennessee"            "TX-Texas"               
## [45] "UT-Utah"                 "VA-Virginia"            
## [47] "VT-Vermont"              "WA-Washington"          
## [49] "WI-Wisconsin"            "WV-West Virginia"       
## [51] "WY-Wyoming"             
## 
## $State_Region
## [1] "D1 New England"        "D2 Middle Atlantic"    "D3 East North Central"
## [4] "D4 West North Central" "D5 South Atlantic"     "D6 East South Central"
## [7] "D7 West South Central" "D8 Mountain"           "D9 Pacific"           
## 
## $Source_UC
## [1] "ADDM" "MEDI" "NSCH" "SPED"
## 
## $Source_Full3
## [1] "ADDM Autism & Developmental Disabilities Monitoring Network"
## [2] "MEDI Medicaid"                                              
## [3] "NSCH National Survey of Children's Health"                  
## [4] "SPED Special Education Child Count"                         
## 
## $Prevalence_Risk2
## [1] "Low"  "High"
## 
## $Prevalence_Risk4
## [1] "Low"       "Medium"    "High"      "Very High"
## 
## $Year_Factor
##  [1] "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008" "2009"
## [11] "2010" "2011" "2012" "2013" "2014" "2015" "2016"

Optionally, export the processed dataframe data to CSV file.

write.csv(ASD_State, file = "../dataset/ADV_ASD_State_R.csv", row.names = FALSE)
# Read back in above saved file:
# ASD_State <- read.csv("../dataset/ADV_ASD_State_R.csv")
# ASD_State$Year_Factor <- factor(ASD_State$Year_Factor, ordered = TRUE) # Convert Year_Factor to ordered.factor
# ASD_State$Prevalence_Risk2 = factor(ASD_State$Prevalence_Risk2, ordered=TRUE, levels=c("Low", "High"))
# ASD_State$Prevalence_Risk4 = factor(ASD_State$Prevalence_Risk4, ordered=TRUE, levels=c("Low", "Medium", "High", "Very High"))
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">US. State Level Data Visualisation</span>
</h3>

<h3>
<span style="color:blue">Above chat shows at data source level, we'd also like to know State level data availbility. How?</span>
</h3>
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Explore the Data</span> [ Years Data Available by State ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=12)
# ----------------------------------
# [State] < Years Data Available by State >
# ----------------------------------
p <- ggplot(ASD_State, aes(x = Source, fill = Source)) + 
  geom_bar() + theme(axis.text.x=element_blank(),  # Hide axis
                     axis.ticks.x=element_blank(), # Hide axis
                     axis.text.y=element_blank(),  # Hide axis
                     axis.ticks.y=element_blank(), # Hide axis
                     panel.background = element_blank(), # Remove panel background
                     legend.position="top",
                     strip.text.y = element_text(angle=0) # Rotate text to horizontal
  ) + 
  scale_fill_manual("Data Source:", values = c("addm" = "darkblue", 
                                               "medi" = "orange", 
                                               "nsch" = "darkred",
                                               "sped" = "skyblue")) +
  facet_grid(facets = State_Full2 ~ Year) +
  labs(x="", y="", title="Years Data Available by State") # layers of graphics
# Below plot may run for a while
# Show plot
p

Filter and create dataframe of different data sources, for easy data access

# Filter and create dataframe of different data sources, for easy data access
ASD_State_ADDM <- subset(ASD_State, Source == 'addm')
ASD_State_MEDI <- subset(ASD_State, Source == 'medi')
ASD_State_NSCH <- subset(ASD_State, Source == 'nsch')
ASD_State_SPED <- subset(ASD_State, Source == 'sped')
<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] Explore the Data</span> Years Data Available by State [ Source: ADDM ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)

Years Data Available by State [ Source: ADDM ]

# Years Data Available by State [ Source: ADDM ]
p <- ggplot(ASD_State_ADDM, aes(x = 1, fill = State_Full2)) + 
  geom_bar() + theme(axis.text.x=element_blank(),  # Hide axis
                     axis.ticks.x=element_blank(), # Hide axis
                     axis.text.y=element_blank(),  # Hide axis
                     axis.ticks.y=element_blank(), # Hide axis
                     panel.background = element_blank(), # Remove panel background
                     legend.position="none",
                     strip.text.y = element_text(angle=0) # Rotate text to horizontal
  ) +
  facet_grid(facets = State_Full2 ~ Year_Factor) +
  labs(x="", y="", title="Years Data Available by State [ Source: ADDM ]") # layers of graphics
# Show plot
p

<h3>
    Quiz:
</h3>
<p>
    Create <span style="color:blue">Years Data Available by State [ Source: XXXX ]</span> for other three data sources:
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION (States)</span> Prevalence Estimates by State [ Source: ADDM ]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)

Visualise: Prevalence Estimates by State [ Source: ADDM ]

# Prevalence Estimates by State [ Source: ADDM ] , aggregated for different years
p <- ggplot(ASD_State_ADDM, aes(x = reorder(State_Full2, Prevalence, FUN = median), # Order States by median of Prevalence  
                                y = Prevalence)) + 
  geom_boxplot(aes(fill = reorder(State_Full2, Prevalence, FUN = median))) + # fill color by State
  scale_fill_discrete(guide = guide_legend(title = "US. States")) + # Legend Name
  #  geom_boxplot(fill = 'darkslategrey', alpha = 0.2) + 
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(0, 30, 5),
                     limits=c(0, 30)) +
  scale_x_discrete(name = "") +
  ggtitle("Prevalence Estimates by State [ Source: ADDM ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"),
        legend.position = 'none') + 
  coord_flip() + # Rotate chart
  geom_jitter(alpha = 0.5, position = position_jitter(width = 0.1)) # Add actual data points
# Show plot
p

# Theme of the economist magazine:
# p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
    Quiz:
</h3>
<p>
    Create <span style="color:blue">Prevalence Estimates by State [ Source: XXXX ]</span> for other three data sources:
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> No. Children Surveyed by State [ Source: ADDM ] [Year 2014]
</h3>
# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=4)

Visualise: No. Children Surveyed by State [ Source: ADDM ] [Year 2014]

# All State Prevalence data with: Source == 'addm' & Year == 2014
# filter using dataframe: ASD_State_ADDM
ASD_State_Subset <- subset(ASD_State_ADDM, Year == 2014)
# or filer using dataframe: ASD_State
ASD_State_Subset <- subset(ASD_State, Source == 'addm' & Year == 2014)
# Bar plot/chart for < No. Children surveyed by State [ADDM] [Year 2014] >
p <- ggplot(ASD_State_Subset, aes(x = reorder(State_Full1, Denominator, FUN = median), # Order States by median of Denominator  
                                  y = Denominator)) + 
  geom_bar(stat="identity", aes(fill = reorder(State_Full1, Denominator, FUN = median))) + # fill color by State
  scale_fill_discrete(guide = guide_legend(title = "US. States")) + # Legend Name
  scale_x_discrete(name = "US. States") +
  scale_y_continuous(name = "No. Children (Denominator)") +
  ggtitle("No. Children Surveyed by State [ Source: ADDM ] [Year 2014]") +
  #  geom_text(aes(label=Denominator), vjust=1.6, color="darkslategrey", size=3.5) + # Show data label inside bars
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"),
        legend.position="none") 
# Show plot
p

# Theme of the economist magazine:
# p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + theme(legend.position = 'none')
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
    Quiz:
</h3>
<p>
    Create <span style="color:blue">No. Children Surveyed by State [ Source: XXXX ] [Year CCYY]</span> for other data sources & years:
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
    Quiz:
</h3>
<p>
    Create <span style="color:blue">No. ASD Children by State [ Source: XXXX ] [Year CCYY]</span> for other data sources & years:
</p>
<p>
    Hint: Use variable: ASD_State_ADDM$Numerator_ASD
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> Prevalence Estimates with 95% CI by State [ Source: ADDM ] [ Year 2014 ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)

Visualise: Prevalence Estimates with 95% CI by State [ Source: ADDM ] [ Year 2014 ]

# ASD_State_Subset <- subset(ASD_State_ADDM, Year == 2014)
# or
# ASD_State_Subset <- subset(ASD_State, Source == 'addm' & Year == 2014)

# Point plot/chart 
p = ggplot(ASD_State_Subset, aes(x = reorder(State_Full1, Prevalence, median), # Order States by median of Prevalence  
                                 y = Prevalence)) + 
  geom_point(stat="identity", aes(colour = reorder(State_Full1, Prevalence, median)), size = 10, alpha = 0.1, pch = 15) + # fill color by State
  scale_colour_discrete(guide = guide_legend(title = "US. States")) + # Legend Name
  scale_y_continuous(name = "Prevalence per 1,000 Children",
                     breaks = seq(10, 35, 5),
                     limits=c(10, 35)) +
  scale_x_discrete(name = "US. States") +
  ggtitle("Prevalence Estimates with 95% CI by State [ Source: ADDM ] [ Year 2014 ]") +
  theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
        axis.title = element_text(face = 'plain', color = "darkslategrey"),
        legend.position = 'none') +
  geom_text(aes(label=Prevalence), hjust=0.5, color="black", size=3.5)  # Show data label inside bars
# Show plot
p

# Add Lower.CI
p = p + geom_point(data = ASD_State_Subset, aes(x = reorder(State_Full1, Prevalence, median), y = Lower.CI,
                                                shape=Source # point shape
), 
size = 2 # point size
) +
  #  geom_text(aes(label=Lower.CI), hjust=-0.1, vjust=3, color="darkslategrey", size=2.5) + # Show data label inside bars 
  scale_shape_manual(values=3)  # manual define point shape
# Show plot
p

# Add Upper.CI
p = p + geom_point(data = ASD_State_Subset, aes(x = reorder(State_Full1, Prevalence, median), y = Upper.CI, 
                                                shape=Source # point shape
), 
size = 2 # point size
) 
#  geom_text(aes(label=Upper.CI), hjust=-0.1, vjust=-3, color="darkslategrey", size=2.5) # Show data label inside bars 
# Show plot
p

# theme of the economist magazine:
# p + theme_economist() + scale_colour_economist() + scale_colour_discrete(guide = guide_legend(title = "US. States")) + theme(legend.position = 'none')
# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + scale_colour_discrete(guide = guide_legend(title = "US. States")) + theme(legend.position = 'none')
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
    Quiz:
</h3>
<p>
    Create <span style="color:blue">Prevalence Estimates with 95% CI by State [ Source: ADDM ] [Year CCYY]</span> for other data sources & years:
</p>
# Write your code below and press Shift+Enter to execute 

Double-click here for the solution.

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> Prevalence Estimates over Year [ Source: ADDM ] [ State: AZ-Arizona ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)

Visualise: Prevalence Estimates over Year [ Source: ADDM ] [ State: AZ-Arizona ]

# All year/time Prevalence data with: Source_UC == 'ADDM' & State_Full2 == 'AZ-Arizona'
ASD_State_Subset <- subset(ASD_State, Source_UC == 'ADDM' & State_Full2 == 'AZ-Arizona')

# Line plot/chart for < State ASD Prevalence [ADDM] [AZ-Arizona] >
p <- ggplot(ASD_State_Subset, aes(x = Year, y = Prevalence))
# Select (add) line chart type:
p <- p + geom_line(aes(color = State_Full2),
                   linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
                   size=1,
                   alpha=0.5) 
# Select (add) points to chart:
p <- p + geom_point(aes(color = State_Full2),
                    size=3, 
                    shape=20,
                    alpha=0.5) 
# Customize legend name:
p <- p + labs(color = "US. State")
# Adjust x and y axis, scale, limit and labels:
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
                            breaks = seq(0, 30, 5),
                            limits=c(0, 30)) +
  scale_x_continuous(name = "Year", 
                     breaks = seq(2000, 2016, 1), 
                     limits = c(2000, 2016)) 
# Customize chart title:
p <- p + ggtitle("Prevalence Estimates over Year [ Source: ADDM ] [ State: AZ-Arizona ]") 
# Customize chart title and axis labels:
p <- p + theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
               axis.title = element_text(face = 'plain', color = "darkslategrey")) 
# Show plot
p

# Theme of the economist magazine:
p + theme_economist() + scale_colour_economist()

<h3>
Data Visualisation (Enhanced) - <span style="color:blue">[ R ] US. State Level</span> Prevalence Estimates over Year [ Source: ADDM ] [ State: ALL ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)

Visualise: Prevalence Estimates over Year [ Source: ADDM ] [ State: ALL ]

p <- ggplot(ASD_State_ADDM, aes(x = Year, y = Prevalence))
# Select (add) line chart type:
p <- p + geom_line(aes(color = State_Full2),
                   linetype = "solid",  # http://sape.inf.usi.ch/quick-reference/ggplot2/linetype
                   size=1,
                   alpha=0.5) 
# Select (add) points to chart:
p <- p + geom_point(aes(color = State_Full2),
                    size=3, 
                    shape=20,
                    alpha=0.5) 
# Show plot
# p
# Customize line color and legend name:
p <- p + labs(color = "US. State")
# Adjust x and y axis, scale, limit and labels:
p <- p + scale_y_continuous(name = "Prevalence per 1,000 Children",
                            breaks = seq(0, 30, 5),
                            limits=c(0, 30)) +
  scale_x_continuous(name = "Year (2000 - 2016)", 
                     breaks = seq(2000, 2016, 1), 
                     limits = c(2000, 2016)) 
# Customize chart title:
p <- p + ggtitle("Prevalence Estimates over Year [ Source: ADDM ] [ State: ALL ]") 
# Customize chart title and axis labels:
p <- p + theme(title = element_text(face = 'bold.italic', color = "darkslategrey"), 
               axis.title = element_text(face = 'plain', color = "darkslategrey"),
               legend.position="right")
# Show plot
p

# Dynamic chart
p_dynamic <- p + theme_economist() + scale_colour_economist() + scale_colour_discrete(guide = guide_legend(title = "US. States"))
## Scale for 'colour' is already present. Adding another scale for 'colour',
## which will replace the existing scale.
p_dynamic <- ggplotly(p_dynamic)
p_dynamic

Split chart by state

# Show plot in facet_grid
p + facet_grid(facets = . ~ State) + 
  theme(legend.position = "none", # Hide legend
        axis.text.x=element_blank(),  # Hide axis
        axis.ticks.x=element_blank(), # Hide axis
        panel.background = element_blank(), # Remove panel background
        panel.grid.major = element_line(size = 0.1, linetype = 1, colour = "lightgrey")
  ) 
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

<h3>
Data Visualisation (Enhanced) - Plotting on Map
</h3>
# ----------------------------------
# EDA - Visualisation on map
# ----------------------------------
if(!require(usmap)){install.packages("usmap")}
## Loading required package: usmap
library(usmap) # usmap: Mapping the US
<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ CDC ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span>
</h3>

<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span>
</h3>

Let’s review data availability by data Sources & Years:

  • ASD_State_ADDM in Years: 2000, 2002, 2004, 2006, 2008, 2010, 2012, 2014

  • ASD_State_MEDI in Years: 2000 ~ 2012

  • ASD_State_NSCH in Years: 2004, 2008, 2012, 2016

  • ASD_State_SPED in Years: 2000 ~ 2016

<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span> [ Source: ADDM ] [ Year: 2014 ]
</h3>
# Adjust in-line plot size to M x N
# options(repr.plot.width=8, repr.plot.height=4)

Prepare US State level data: [ Source: ADDM ] [ Year: 2014 ]

# Prepare data - addm 2014
Map_Data_Source = 'addm' # Available values lowercase: 'addm', 'medi', 'nsch', 'sped'.
Map_Data_Value = 'Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale

# Uncomment below to use Prevalence of different groups:
# Map_Data_Value = 'Male.Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
# Map_Data_Value = 'Female.Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale
# Map_Data_Value = 'Asian.or.Pacific.Islander.Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale

Map_Data_Year = 2014 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)

The usmap package/function requires input data to have a column of state, or fips. (case sensitive)

  • state: Name of US state

  • fips: FIPS code for either a US state

https://cran.r-project.org/web/packages/usmap/vignettes/mapping.html

https://cran.r-project.org/web/packages/usmap/usmap.pdf

# The usmap package/function requires input data to have a column of 'state', or 'fips'. (case sensitive)
ASD_State_Subset$state = ASD_State_Subset$State 
# Glance
head(ASD_State_Subset)
## # A tibble: 6 x 40
##   State Denominator Prevalence Lower.CI Upper.CI  Year Source Source_Full1
##   <fct>       <int>      <dbl>    <dbl>    <dbl> <int> <fct>  <fct>       
## 1 AZ          24952       14       12.6     15.5  2014 addm   Autism & De…
## 2 AR          39992       13.1     12       14.2  2014 addm   Autism & De…
## 3 CO          41128       13.9     12.8     15.1  2014 addm   Autism & De…
## 4 GA          51161       17       15.9     18.1  2014 addm   Autism & De…
## 5 MD           9955       20       17.4     22.9  2014 addm   Autism & De…
## 6 MN           9767       24       21.1     27.2  2014 addm   Autism & De…
## # … with 32 more variables: State_Full1 <fct>, State_Full2 <fct>,
## #   Numerator_ASD <int>, Numerator_NonASD <int>, Proportion <dbl>,
## #   Chi_Wilson_Corrected_Lower.CI <dbl>, Chi_Wilson_Corrected_Upper.CI <dbl>,
## #   Male.Prevalence <dbl>, Male.Lower.CI <dbl>, Male.Upper.CI <dbl>,
## #   Female.Prevalence <dbl>, Female.Lower.CI <dbl>, Female.Upper.CI <dbl>,
## #   Non.hispanic.white.Prevalence <dbl>, Non.hispanic.white.Lower.CI <dbl>,
## #   Non.hispanic.white.Upper.CI <dbl>, Non.hispanic.black.Prevalence <dbl>,
## #   Non.hispanic.black.Lower.CI <dbl>, Non.hispanic.black.Upper.CI <dbl>,
## #   Hispanic.Prevalence <dbl>, Hispanic.Lower.CI <dbl>,
## #   Hispanic.Upper.CI <dbl>, Asian.or.Pacific.Islander.Prevalence <dbl>,
## #   Asian.or.Pacific.Islander.Lower.CI <dbl>,
## #   Asian.or.Pacific.Islander.Upper.CI <dbl>, State_Region <fct>,
## #   Source_UC <fct>, Source_Full3 <fct>, Prevalence_Risk2 <ord>,
## #   Prevalence_Risk4 <ord>, Year_Factor <ord>, state <fct>

Visualise: Prevalence Estimates by Geographic Area [ Source: ADDM ] [ Year: 2014 ]

# Show data on map
p_map_addm_2014 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, 
                              color = "white", # map line colour
                              labels = TRUE,  # State name shown
                              label_color = 'white' # State name colour
) + 
  scale_fill_continuous(
    na.value = "lightgrey", # Set colour with no State data
    low="lightblue1", high = "darkblue", 
    name = "Prevalence\nper 1,000\nChildren", 
    limits=c(0, 40) #same colour levels/limits for plots
  ) +
  labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"),
       subtitle = 'https://www.cdc.gov/ncbddd/autism'
  ) + 
  theme(panel.background = element_rect(color = "white", fill = "white"),
        legend.position = "right")
# Show map
p_map_addm_2014

# Dynamic map
p_dynamic <- p_map_addm_2014
p_dynamic <- ggplotly(p_dynamic)
p_dynamic
<h3>
Data Visualisation (Enhanced) - Plotting on Map <span style="color:blue">[ R ] REPORTED PREVALENCE VARIES BY GEOGRAPHIC LOCATION</span> [ Source: NSCH] [ Year: 2004, 2008, 2012, 2016 ]
</h3>

Prepare US State level data: [ Source: NSCH ] [ Year: ALL ]

Map_Data_Source = 'nsch' # Available values lowercase: 'addm', 'medi', 'nsch', 'sped'.
Map_Data_Value = 'Prevalence' # variable must be numeric, variable name in 'quotation'. Or else Error: Discrete value supplied to continuous scale

Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2004 ]

# Prepare data - nsch 2004
Map_Data_Year = 2004 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
# Plot on map
p_map_nsch_2004 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2004

Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2008 ]

# Prepare data - nsch 2008
Map_Data_Year = 2008 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
p_map_nsch_2008 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2008

Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2012 ]

# Prepare data - nsch 2012
Map_Data_Year = 2012 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
p_map_nsch_2012 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2012

Visualise: Prevalence Estimates by Geographic Area [ Source: NSCH ] [ Year: 2016 ]

# Prepare data - nsch 2016
Map_Data_Year = 2016 # must be integer
ASD_State_Subset = subset(ASD_State, Source == Map_Data_Source & Year == Map_Data_Year)
ASD_State_Subset$state = ASD_State_Subset$State
p_map_nsch_2016 <- plot_usmap(data = ASD_State_Subset, values = Map_Data_Value, color = "white", labels = F, label_color = 'white' ) + scale_fill_continuous(na.value = "lightgrey", low="lightblue1", high = "darkblue", name = "Prevalence\nper 1,000\nChildren", limits=c(0, 40) ) + labs(title = paste("Prevalence Estimates by Geographic Area", '\n[ Measure :', Map_Data_Value, "] [ Source :", Map_Data_Source, "] [ Year :", Map_Data_Year, "]"), subtitle = 'https://www.cdc.gov/ncbddd/autism' ) + theme(panel.background = element_rect(color = "white", fill = "white"), legend.position = "right")
p_map_nsch_2016

# Dynamic map
p_dynamic <- p_map_nsch_2016 # [ Source: NSCH ] [ Year: 2016 ]
p_dynamic <- ggplotly(p_dynamic)
p_dynamic

Combine multiple plots to show in one page/screen:

# Adjust in-line plot size to M x N
options(repr.plot.width=8, repr.plot.height=6)
# ----------------------------------
# Combine multiple plots 
# ----------------------------------
if(!require(cowplot)){install.packages("cowplot")}
## Loading required package: cowplot
## 
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
##   default ggplot2 theme anymore. To recover the previous
##   behavior, execute:
##   theme_set(theme_cowplot())
## ********************************************************
## 
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggthemes':
## 
##     theme_map
library('cowplot')
cowplot::plot_grid(
  p_map_nsch_2004,
  p_map_nsch_2008,
  p_map_nsch_2012,
  p_map_nsch_2016,
  nrow = 2)

Export current plot as image file:

# ----------------------------------
# Export current plot as image file
# ----------------------------------
ggsave("plot Map Prevalence Estimates by Geographic Area [NSCH] [2004-2016].png", 
       width = 60, height = 30, units = 'cm')
<a href="">
</a>

Workshop Submission

<h3>
    What to submit?
</h3>
<p>
    Choose one of below visualisations/charts, use R to construct the chart nicely.
</p>
<p>
    Optionally, enhance it with additional data dimensions to be better than original chart.
</p>

https://www.cdc.gov/ncbddd/autism/data/index.html

# Write your code below and press Shift+Enter to execute 

Excellent! You have completed the workshop notebook!

Connect with the author:

This notebook was written by GU Zhan (Sam).

Sam is currently a lecturer in Institute of Systems Science in National University of Singapore. He devotes himself into pedagogy & andragogy, and is very passionate in inspiring next generation of artificial intelligence lovers and leaders.

Copyright © 2020 GU Zhan

This notebook and its source code are released under the terms of the MIT License.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

<a href="">
</a>

Appendices

<h3>
Interactive workshops: < Learning R inside R > using swirl() (in R/RStudio)
</h3>

https://github.com/telescopeuser/S-SB-Workshop

<h3>
Correlation of Numeric Variables
</h3>
# ----------------------------------
# Correlation of Numeric Variables
# ----------------------------------
cor_df = select_if(ASD_State, is.numeric) # Select only numeric variables
cor_df = cor_df[, colSums(is.na(cor_df)) == 0] #  Select vaariables without NA

# Compute correlation matrix for No-NA numeric variables:
cor_table = cor(cor_df)
cor_table
##                               Denominator Prevalence    Lower.CI    Upper.CI
## Denominator                    1.00000000 -0.1374662 -0.07863304 -0.17389486
## Prevalence                    -0.13746621  1.0000000  0.95813468  0.96568034
## Lower.CI                      -0.07863304  0.9581347  1.00000000  0.85132455
## Upper.CI                      -0.17389486  0.9656803  0.85132455  1.00000000
## Year                           0.02851671  0.6400295  0.67690938  0.56480277
## Numerator_ASD                  0.82429404  0.1121787  0.21429644  0.02005452
## Numerator_NonASD               0.99999025 -0.1392238 -0.08080949 -0.17516773
## Proportion                    -0.13735462  0.9999677  0.95851437  0.96524017
## Chi_Wilson_Corrected_Lower.CI -0.08734046  0.9761979  0.99597141  0.88837741
## Chi_Wilson_Corrected_Upper.CI -0.17380524  0.9798117  0.88384420  0.99561482
##                                     Year Numerator_ASD Numerator_NonASD
## Denominator                   0.02851671    0.82429404       0.99999025
## Prevalence                    0.64002950    0.11217865      -0.13922381
## Lower.CI                      0.67690938    0.21429644      -0.08080949
## Upper.CI                      0.56480277    0.02005452      -0.17516773
## Year                          1.00000000    0.29628163       0.02638864
## Numerator_ASD                 0.29628163    1.00000000       0.82178563
## Numerator_NonASD              0.02638864    0.82178563       1.00000000
## Proportion                    0.64020778    0.11251687      -0.13911415
## Chi_Wilson_Corrected_Lower.CI 0.67167964    0.19523745      -0.08942415
## Chi_Wilson_Corrected_Upper.CI 0.58775086    0.03675270      -0.17520779
##                               Proportion Chi_Wilson_Corrected_Lower.CI
## Denominator                   -0.1373546                   -0.08734046
## Prevalence                     0.9999677                    0.97619788
## Lower.CI                       0.9585144                    0.99597141
## Upper.CI                       0.9652402                    0.88837741
## Year                           0.6402078                    0.67167964
## Numerator_ASD                  0.1125169                    0.19523745
## Numerator_NonASD              -0.1391141                   -0.08942415
## Proportion                     1.0000000                    0.97646889
## Chi_Wilson_Corrected_Lower.CI  0.9764689                    1.00000000
## Chi_Wilson_Corrected_Upper.CI  0.9796180                    0.91344122
##                               Chi_Wilson_Corrected_Upper.CI
## Denominator                                      -0.1738052
## Prevalence                                        0.9798117
## Lower.CI                                          0.8838442
## Upper.CI                                          0.9956148
## Year                                              0.5877509
## Numerator_ASD                                     0.0367527
## Numerator_NonASD                                 -0.1752078
## Proportion                                        0.9796180
## Chi_Wilson_Corrected_Lower.CI                     0.9134412
## Chi_Wilson_Corrected_Upper.CI                     1.0000000
# ----------------------------------
# Visualise Correlation Matrix
# ----------------------------------

if(!require(corrplot)){install.packages("corrplot")}
## Loading required package: corrplot
## corrplot 0.84 loaded
library('corrplot')
# Sort on decreasing correlations with Prevalence
cor_table_sorted <- as.matrix(sort(cor_table[,'Prevalence'], decreasing = TRUE))
#
cor_table_sorted
##                                     [,1]
## Prevalence                     1.0000000
## Proportion                     0.9999677
## Chi_Wilson_Corrected_Upper.CI  0.9798117
## Chi_Wilson_Corrected_Lower.CI  0.9761979
## Upper.CI                       0.9656803
## Lower.CI                       0.9581347
## Year                           0.6400295
## Numerator_ASD                  0.1121787
## Denominator                   -0.1374662
## Numerator_NonASD              -0.1392238
# Select corelations variables based on threshold:
#cor_var_high <- names(which(apply(cor_table_sorted, 1, function(x) abs(x)>0.25)))
cor_var_high <- names(which(apply(cor_table_sorted, 1, function(x) abs(x)>0.05)))
#
cor_var_high
##  [1] "Prevalence"                    "Proportion"                   
##  [3] "Chi_Wilson_Corrected_Upper.CI" "Chi_Wilson_Corrected_Lower.CI"
##  [5] "Upper.CI"                      "Lower.CI"                     
##  [7] "Year"                          "Numerator_ASD"                
##  [9] "Denominator"                   "Numerator_NonASD"
# Visualise:
cor_table_plot <- cor_table[cor_var_high, cor_var_high]
# cor_table_plot
#
corrplot(cor_table_plot, tl.col="black", tl.pos = "lt")

<a href="https://github.com/dd-consulting">
     <img src="../reference/GZ_logo.png" width="60" align="right">
</a>